Knight Capital Failure Due to Development Bug
Learn how a trading firm lost $440 million in 45 minutes due to a software bug.
We'll cover the following
On 1st August 2012, a trading firm that goes by the name of Knight Capital lost $440 million due to what they refer to as a "trading glitch." The example of Knight Capital is of particular interest because it took the company 17 years of hard work to build itself, but it went almost bankrupt in roughly 45 minutes. A simple software update caused Knight Capital to lose 75% of its value in merely 48 hours.
Knight Capital had a rich API providing a variety of trading functionalities to its users. Unfortunately for Knight, it didn't last because of a major developmental bug. Let's see how this failure was caused by an update in the trading algorithm of Knight Capital.
How did it happen?#
This section details the sequence of events that caused the failure:
Knight's software development team wanted to update their trading execution system called Smart Market Access Routing System (SMARS) within a short span of one month to complete the development cycle.
The development team had to deploy the updated trading algorithm to eight production servers containing outdated dead test code from as far back as 2003. The dead test code was called Power Peg. It was previously used for quality assurance (QA) purposes, and it required a flag to be activated. Knight repurposed the activation flag used for Power Peg to execute the SMARS updated code.
The deployment team forgot to update the trading algorithm in one of the eight production servers. At the same time, the repurposed flag was activated for all production servers.
This resulted in activating the flag for the execution of Power Peg on one of the eight production servers. Power Peg was a test code that placed orders that bought stocks for higher prices and sold them for lower prices. This caused disastrous losses for the trading firm.
1 of 4
2 of 4
3 of 4
4 of 4
Analysis and takeaways#
Need for standardization: It’d be easy to blame the engineer or engineering team for lack of attention during the code update process. However, such scenarios are inevitable when there are no modern software development and operating practices (DevOps).
Prune old code: It’s a mistake to keep almost decade-old test code on production servers, but repurposing activation flags to execute new features in the code is also not recommended.
Move towards automation: Knight relied on the manual deployment of code on the production level, which can lead to issues that require later troubleshooting. Such processes should be automated.
Stringent timelines: The development team was in a race against time to develop and update the trading system within a short time of one month. This is one reason why all required checks were not in place.
Develop incident response guidelines: No incident response mechanisms were in place to tackle such situations. As soon as the situation was realized, Knight attempted to resolve the problem by uninstalling the updated trading algorithm from the rest of the seven production servers. This worsened the situation because the rest of the seven servers also started behaving like the eighth server.
Need for risk-management processes: Knight used no or inefficient risk-management processes to combat such situations.
Mitigation techniques to employ#
Using standard software development procedures is one thing, but the use of tools to perform subtasks in each phase of software development is also important. We list some important practices below:
Use version control software like GitHub.
Sandbox test code instead of putting it on the production server.
Perform regression and unit testing before deploying to production. Also, code reviews are critical.
Enable automated deployment of code.
Other than adhering to software development life cycle standards, monitoring the system's performance from established vantage points is vital. A monitoring and alerting system could have cut much of the loss, especially in the case of Knight, where a large number of orders were processed in a short period.
In this lesson, we learned how a simple development and code deployment mistake could lead to failure, bringing down an entire company.
What Causes API Failures
Amazon S3 Service Disruption